OcrV1, Main, Exploration, bibRecord, 000A15

An Integrated Architecture for Processing Business Documents in Turkish

Identifieur interne : 000A15 ( Main/Exploration ); précédent : 000A14; suivant : 000A16

An Integrated Architecture for Processing Business Documents in Turkish

Auteurs : Serif Adali [Turquie] ; Coskun Sonmez [Turquie] ; Mehmet Gokturk [Turquie]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2009.

RBID : ISTEX:EC872692031D3D2184E99A18DBC3D8E352AAB509

Abstract

Abstract: This paper covers the first research activity in the field of automatic processing of business documents in Turkish. In contrast to traditional information extraction systems which process input text as a linear sequence of words and focus on semantic aspects, proposed approach doesn’t ignore document layout information and benefits hints provided by layout analysis. In addition, approach not only checks relations of entities across document for verifying its integrity, but also verifies extracted information against real word data (e.g. customer database). This rule-based approach uses a morphological analyzer for Turkish, a lexicon integrated domain ontology, a document layout analyzer, an extraction ontology and a template mining module. Based on extraction ontology, conceptual sentence analysis increases portability which requires only domain concepts when compared to information extraction systems that rely on large set of linguistic patterns.

Url:

https://api.istex.fr/document/EC872692031D3D2184E99A18DBC3D8E352AAB509/fulltext/pdf

DOI: 10.1007/978-3-642-00382-0_32

Affiliations:

Turquie

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000957
to stream Istex, to step Curation: 000947
to stream Istex, to step Checkpoint: 000537
to stream Main, to step Merge: 000A23
to stream Main, to step Curation: 000A15

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">An Integrated Architecture for Processing Business Documents in Turkish</title>
<author><name sortKey="Adali, Serif" sort="Adali, Serif" uniqKey="Adali S" first="Serif" last="Adali">Serif Adali</name>
</author>
<author><name sortKey="Sonmez, Coskun" sort="Sonmez, Coskun" uniqKey="Sonmez C" first="Coskun" last="Sonmez">Coskun Sonmez</name>
</author>
<author><name sortKey="Gokturk, Mehmet" sort="Gokturk, Mehmet" uniqKey="Gokturk M" first="Mehmet" last="Gokturk">Mehmet Gokturk</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:EC872692031D3D2184E99A18DBC3D8E352AAB509</idno>
<date when="2009" year="2009">2009</date>
<idno type="doi">10.1007/978-3-642-00382-0_32</idno>
<idno type="url">https://api.istex.fr/document/EC872692031D3D2184E99A18DBC3D8E352AAB509/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000957</idno>
<idno type="wicri:Area/Istex/Curation">000947</idno>
<idno type="wicri:Area/Istex/Checkpoint">000537</idno>
<idno type="wicri:doubleKey">0302-9743:2009:Adali S:an:integrated:architecture</idno>
<idno type="wicri:Area/Main/Merge">000A23</idno>
<idno type="wicri:Area/Main/Curation">000A15</idno>
<idno type="wicri:Area/Main/Exploration">000A15</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">An Integrated Architecture for Processing Business Documents in Turkish</title>
<author><name sortKey="Adali, Serif" sort="Adali, Serif" uniqKey="Adali S" first="Serif" last="Adali">Serif Adali</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Department of Computer Engineering, Istanbul Technical University, 34469, Istanbul</wicri:regionArea>
<wicri:noRegion>Istanbul</wicri:noRegion>
</affiliation>
<affiliation><wicri:noCountry code="no comma">E-mail: serifadali@yahoo.com</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Sonmez, Coskun" sort="Sonmez, Coskun" uniqKey="Sonmez C" first="Coskun" last="Sonmez">Coskun Sonmez</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Department of Computer Engineering, Yildiz Technical University, 34349, Istanbul</wicri:regionArea>
<wicri:noRegion>Istanbul</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Turquie</country>
</affiliation>
</author>
<author><name sortKey="Gokturk, Mehmet" sort="Gokturk, Mehmet" uniqKey="Gokturk M" first="Mehmet" last="Gokturk">Mehmet Gokturk</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Department of Computer Engineering, Gebze Institute of Technology, 41400, Kocaeli</wicri:regionArea>
<wicri:noRegion>Kocaeli</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Turquie</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2009</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">EC872692031D3D2184E99A18DBC3D8E352AAB509</idno>
<idno type="DOI">10.1007/978-3-642-00382-0_32</idno>
<idno type="ChapterID">32</idno>
<idno type="ChapterID">Chap32</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: This paper covers the first research activity in the field of automatic processing of business documents in Turkish. In contrast to traditional information extraction systems which process input text as a linear sequence of words and focus on semantic aspects, proposed approach doesn’t ignore document layout information and benefits hints provided by layout analysis. In addition, approach not only checks relations of entities across document for verifying its integrity, but also verifies extracted information against real word data (e.g. customer database). This rule-based approach uses a morphological analyzer for Turkish, a lexicon integrated domain ontology, a document layout analyzer, an extraction ontology and a template mining module. Based on extraction ontology, conceptual sentence analysis increases portability which requires only domain concepts when compared to information extraction systems that rely on large set of linguistic patterns.</div>
</front>
</TEI>
<affiliations><list><country><li>Turquie</li>
</country>
</list>
<tree><country name="Turquie"><noRegion><name sortKey="Adali, Serif" sort="Adali, Serif" uniqKey="Adali S" first="Serif" last="Adali">Serif Adali</name>
</noRegion>
<name sortKey="Gokturk, Mehmet" sort="Gokturk, Mehmet" uniqKey="Gokturk M" first="Mehmet" last="Gokturk">Mehmet Gokturk</name>
<name sortKey="Gokturk, Mehmet" sort="Gokturk, Mehmet" uniqKey="Gokturk M" first="Mehmet" last="Gokturk">Mehmet Gokturk</name>
<name sortKey="Sonmez, Coskun" sort="Sonmez, Coskun" uniqKey="Sonmez C" first="Coskun" last="Sonmez">Coskun Sonmez</name>
<name sortKey="Sonmez, Coskun" sort="Sonmez, Coskun" uniqKey="Sonmez C" first="Coskun" last="Sonmez">Coskun Sonmez</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A15 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000A15 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:EC872692031D3D2184E99A18DBC3D8E352AAB509
   |texte=   An Integrated Architecture for Processing Business Documents in Turkish
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

An Integrated Architecture for Processing Business Documents in Turkish

An Integrated Architecture for Processing Business Documents in Turkish

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri